-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add reproducer for consecutive RepartitionExec #18343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@gene-bordegaray : can you help review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Just want to ask clarifying question to make sure I am understanding:
- the Round Robin Repartition into the Aggregate is useful because we can disperse work across partitions and then accumulate their results. Using the aggregated results we can use the Hash Repartition to hand off work with the same key (such as env = 'prod') to workers, thus is more efficient
- the parquet query is not working this way as the Reparititons are not separated by the Aggregate. The Aggregate does all this work on a single partition then does Repartitioning too late.
Yes -- you can read more about the two phase aggregation in general here https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html#tymethod.state |
Correct. And the ticket of this reproducer #18341 proposes 3 different ways to improve it |
Ok awesome, I will take some time to understand these approaches and give some of my thoughts |
|
Thanks @NGA-TRAN @gene-bordegaray and @Dandandan |
Reproducer for #18341